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Summary. This is a collection of discussions of 'Riemann manifold Langevin and 
Hamiltonian Monte Carlo methods" by Girolami and Calderhead, to appear in the 
Journal of the Royal Statistical Society, Series B. 

1 . Connections with optimisation (S. Barthelme and N. Chopin) 

One of the many things we like about this paper is that it forces us to change 
our perspective on Metropolis-Hastings. Wc may not the only ones witli the toy 
example of a bivariate, strongly correlated, Gaussian distribution imprinted in 
our brain. This example explains well why taking correlations into account is 
important. However, one often forgets that, contrary to the Gaussian example, 
the curvature of the log target density may be far from constant, which justifies 
a local calibration of HM strategies. The authors give compelling evidence that 
local calibration may lead to strong improvements in large-dimensional problems. 

There are two ways to understand these results. One of them, put forward 
in this paper, stems from the information geometry perspective: the parameter 
space is endowed with a metric defined by G{9), which turns the posterior dis- 
tribution into a density over a manifold. The general MM ALA algorithm based 
on a diffusion over that manifold is a beautiful mathematical device, but it is 
not immediately apparent how this leads to improved (relative) MCMC perfor- 
mance. A different viewpoint proceeds from optimisation: MMALA performs 
better because it uses a better local model of the posterior density. 

As often pointed out, the Langevin proposal is a noisy version of a gradi- 
ent ascent step. Similarly, the simplified MMALA step is a noisy version of 
a (quasi-)Newton step, in which the Hessian is replaced with the Fisher infor- 
mation matrix, an idea known as Itcrativcly Reweighted Least-Squares in the 
literature on Generalised Linear Models. It is worth emphasising the fact that 



the simplified versions, which just rcUcs on these local curvature ideas, but do 
not require third derivatives, do better in terms of relative efficiency (not to 
mention in terms of human computation time!). 

This suggests two avenues for further research. First, many optimisation 
methods have been developed that only require evaluating the gradient. This 
may be more convenient from the practitioner's point of view, and it also proves 
more effective whenever computing Hessian matrices is expensive. Methods, 
such as the BFGS or Barzilai-Borwein, approximate the Hessian locally from the 
previous k iterations. Our preliminary experiments indicate that these methods 
may reduce the correlation in MCMC chains. 

The second point is that the auxiliary Gaussian distribution is merely a choice 
imposed by the physical interpretation of the Hamiltonian. Do the authors have 
any intuition on what would be the optimal auxiliary distribution? 

2. Multiple potentials (M. Beffy and CP. Robert) 

The paper gives a very clear geometric motivation for the use of an Hamiltonian 
representation. As such, it suggests for an immediate generalisation by extending 
the Hamiltonian dynamic to a more general dynamic on the level sets of 

jr(e,Pi,P2, . . . ,Pfc) = -m + ^ log{(27r)^|Gi(0)|} + ^Pi^Gi(0)-ipi 

+ I log{(27r)^|G2 W|} + ^P2^G2 W-ip2 + . . . 

+ ^ log{(2^)^|Gfe(0)|} + ^Pk^Gfe(^)-ipfc , 

where the p^-s are auxiliary vectors of the same dimension D as 9 and the Gj(^)s 
are symmetric matrices. This function is then associated with the pde's 

in that those moves preserve the potential Jf{6, pi, . . . , p^) and hence the target 
distribution at all times t. This generalisation would allow for using a range of 
information matrices Gj(^)s in parallel. The corresponding RHMC implemen- 
tation is to pick one of the indices j at random and to follow the same moves as 
in the paper, given the separation between the different energies. 

3. Information approximations (A. Doucet, P. Jacob and A.M. Johansen) 

Congratulations to the authors for their elegant contribution. 
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Consider those situations in which one does not have direct access to an 
appropriate metric but can obtain pointwise, simulation-based estimates of its 
values. For example, we might be interested in performing Bayesian inference 
in general state-space Hidden Markov Models (HMM) using particle MCMC 



methods (Andrieu et al. 2010). In this context, we integrate out numerically 
the latent variables of the model using a Sequential Monte Carlo (SMC) scheme. 
A sensible metric to use is the observed information matrix which can also be 



estimated using SMC (Poyadjis et al. 2010). We discuss here the use of such 



estimates in an MCMC context. 

Assume we want to sample from a target t:{x) on X using the Metropolis- 
Hastings (M-H) algorithm. Denote the proposal's parameters (e.g. scale) r G TZ. 
Defining an extended target over X x TZ, as tt^x, r) = 'K{x)q{r\x) an algorithm 
may be defined on A" x 7?. in which both R and X are sampled. 

At iteration n -I- 1 draw X* ~ s(-|x„,r„) and R* ^ q{-\x*). Accept this 
proposal with the standard MH acceptance probability on the extended space 

Tf{x*,r*) s{xn\x*,r*)q{rn\xn) 



a{xn,rn;x* ,r*) =1 A 
=1 A 



7f(x„,r„) s{x*\xn,rn)qir*\x*) 

Tr{xn) s{x*\xn,r„)' 

Hence it is not necessary to be able to evaluate q, even pointwise, provided that 
it can be sampled from. The resulting transition is reversible on the extended 
space and admits tt as a marginal of its invariant distribution. This simple result 



is well-known: see Besag (1995 Appendix 1). 

The MMALA, with metric tensor obtained by sampling, may be justified 
using precisely the same argument: A proposal of the form of (10), may be 
implemented with a sampled estimate of the metric tensor and such gradients as 
are required (objects which can be obtained readily in settings of interest, such as 
HMMs); the extended space construction above holds with x = 9;r = (G, VG) 
and the acceptance probability remains of the same form; the constant curvature 
proposal may be implemented without the need for estimates of VG with x — 9 
and r = G. 

The HMC variant of the same is not trivial. As each step of the implicit in- 
tegrator requires access to the value of the metric at several (implicitly-defined) 
points, direct application of the above principles does not appear possible. How- 
ever, more subtle approaches can be employed. In particular one could consider 
trying to approximate the metric using the expectation of a function with respect 
to a probability measure independent of x and using common random variates 
from this measure during an HMC update. 

4. On some examples (J.-M. Marin and CP. Robert) 

This paper is a welcome addition to the recent MCMC literature and the au- 
thors are to be congratulated for linking together the two threads that are the 
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Langevin modification of the random walk Metropolis-Hastings algorithm and 
the Hamiltonian acceleration. Overall, trying to take advantage of second order 
properties of the target density n{9), just like the Langevin improvement takes 



advantage of the first order (Roberts and Tweedie 1995 Stramer and Tweedie 
1999a|b I is a natural idea which, when implementable, can obviously speed up 



convergence. This is the Langevin part, which may use a fixed metric M or a 
local metric defining a Riemann manifold, G{6). Obviously, this requires that 
the derivation of an appropriate (observed or expected) information matrix G{9) 
is feasible up to some approximation level. Or else that authoritative enough 
directions are given about the choice of an alternative G{9). 

While the logistic example used in the paper mostly is a toy problem (where 
importance sampling works extremely well, as shown in Marin and Robert [2010 ), 
the stochastic volatility model is more challenging and the fact that the Hamil- 
tonian scheme applies to the missing data (volatility) as well as to the three 
parameters of the model is quite interesting. We would thus welcome more de- 
tails on the implementation of the algorithm in such a large dimension space. 
We however wonder at the appeal of this involved scheme when considering that 
the full conditional distribution of the volatility can be simulated exactly. 



5. Moving away from continuous time (CP. Robert) 

This paper is an interesting addition to recent MCMC literature and I am eager 
to see how the community is going to react to this potential addition to the 
MCMC toolbox. I am however wondering about the impact of the paper on 
MCMC practice. Indeed, while the dynamic on the level sets of 

y^{e,p) = ~m -t- iiog{(2^)^|G(e)|} + ^pTG(0)-ip, 

where p is an auxiliary vector of dimension D, is associated with Hamilton's 
equations, in that those moves preserve the potential Jif{9^ p) and hence the 
target distribution at all times t, I argue that the transfer to the simulation side, 
i.e. the discretisation part, is not necessarily useful, or at least that it does not 
need to be so painstakingly reproducing the continuous phenomenon. 

In a continuous time-frame, the purpose of the auxiliary vector p is clearly 
to speed up the exploration of the posterior surface by taking advantage of the 
additional energy it provides. In the discrete-time universe of simulation, on 
the one hand, the fact that the discretised (Euler) approximation to Hamil- 
ton's equations are not exact nor available in closed form does not present such 
a challenge in that approximations can be corrected by a Metropolis-Hastings 
step, provided of course all terms in the Metropolis-Hastings ratio are available. 
On the other hand, the continuous-time (physical or geometric) analogy at the 
core of the Hamiltonian may be unnecessary costly when trying to carry a phys- 
ical pattern in a discrete (algorithmic) time. MCMC algorithms are not set to 
work in continuous time and therefore the invariance and stability properties 
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of the continuous-time process that motivates the method do not carry to the 
discretised version of the process. For one thing, the (continuous) time unit has 
no equivalent in discrete time. Therefore, the dynamics of the Hamiltonian do 
not tell us how long the discretised version should run, as illustrated on Figure 
[l] As a result, convergence issues (of the MCMC algorithm) should not be im- 
pacted by inexact renderings of the continuous-time process in discrete time. For 
instance, when considering the Langevin diffusion, the corresponding Langevin 
algorithm could as well use another scale rj for the gradient than the one r used 
for the noise, i.e. 

y — X* + r]VTT{x) + ret 

rather than a strict Euler discretisation where r] = A few experiments 

run in Robert and Casella (1999, Chapter 6, Section 6.5) showed that using 



a different scale rj could actually lead to improvements, even though we never 
pursued the matter any further. 
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Fig. 1. Comparison of the fits of discretised Langevin diffusions to tlie target f{x) oc 
exp(-2;'') when using a discretisation step = .01 (left) and = .0001 (right), after 
T = 10^ steps. This comparison illustrates^^he need for more time steps when using a 
smaller discretisation step. 



